An unbiased kinship estimation method for genetic data analysis

Accurate estimate of relatedness is important for genetic data analyses, such as heritability estimation and association mapping based on data collected from genome-wide association studies. Inaccurate relatedness estimates may lead to biased heritability estimations and spurious associations. Individual-level genotype data are often used to estimate kinship coefficient between individuals. The commonly used sample correlation-based genomic relationship matrix (scGRM) method estimates kinship coefficient by calculating the average sample correlation coefficient among all single nucleotide polymorphisms (SNPs), where the observed allele frequencies are used to calculate both the expectations and variances of genotypes. Although this method is widely used, a substantial proportion of estimated kinship coefficients are negative, which are difficult to interpret. In this paper, through mathematical derivation, we show that there indeed exists bias in the estimated kinship coefficient using the scGRM method when the observed allele frequencies are regarded as true frequencies. This leads to negative bias for the average estimate of kinship among all individuals, which explains the estimated negative kinship coefficients. Based on this observation, we propose an unbiased estimation method, UKin, which can reduce kinship estimation bias. We justify our improved method with rigorous mathematical proof. We have conducted simulations as well as two real data analyses to compare UKin with scGRM and three other kinship estimating methods: rGRM, tsGRM, and KING. Our results demonstrate that both bias and root mean square error in kinship coefficient estimation could be reduced by using UKin. We further investigated the performance of UKin, KING, and three GRM-based methods in calculating the SNP-based heritability, and show that UKin can improve estimation accuracy for heritability regardless of the scale of SNP panel.

Proof of Property 1.
For the j-th single nucleotide polymorphism (SNP) (1 ≤ j ≤ m), let f j be the frequency of the reference allele (with label A) at that SNP. Consider a pair of individuals i and i whose kinship coefficient is denoted by φ ii , we derive the covariance of X ij and X i j from two different aspects. Recall that we denote ρ ii ,j to be the correlation between X ij and X i j , thus we have On the other hand, X ij can be treated as the sum of two independent Bernoulli random variables. That is, X ij = B ij(1) + B ij(2) . For k = 1, 2, B ij(k) = 0 if the k th allele for i at SNP j is a 1 if the k th allele for i at SNP j is A .
With this expression of X ij , we have As we denote f j to be the probability that a random allele chosen from the jth SNP is A. Notice that B ij(k) B i j(k ) = 1 only when the two alleles selected from i and i at this marker are both with label A, under this circumstance, these two reference alleles are either identical by descent (IBD) or not. For simplicity, let A ij (k) represent the k-th alleles from individual i at SNP j, if we assume IBD genes have the same allelic types and non-IBD genes have independent allelic types, we Consider the definitions of f j and φ ii , we obtain Substituting them into (A.2), we get As X ij is the sum of two i.i.d. Bernoulli random variables whose probability of success is f j , we can derive that σ 2 j = 2f j (1 − f j ). Together with (A.1) and (A.3), we have Equation (A.4) also reveals that the value of correlation ρ ii ,j doesn't depend on which SNP is selected, thus we get Proof of Property 2.
To demonstrate this property, we need a few preparations: i. Consider the result Cov(X ij , X i j ) = ρ ii σ 2 j , we have Directly applying this result yields (1)EX 2 ij = σ 2 j + µ 2 j . ( ii. Based on (1)-(3) stated above, we have With these preparations, we now work on the demonstration of Property 2.
Directly expand the expression on the left side of (A.2), we have

Substituting (A.5) and (A.6) into this expansion, we have
Thus we derive the conclusion in Property 2.

Proof of Property 3.
For ease of calculation, we make a complement to the value range of index i : We observe that Besides, if we change the sequence of summation, we have 1 2n Substituting them into the expansion, we get Thus we finish the proof of Property 3.

Proof of Property 4.
At the start, we focus on a part of the expression on the left side: Substituting (A.7), together with (A.9), into the whole expansion, we get Here we have proved the conclusion in Property 4.

robust GRM and two-step GRM
A general class of GRM estimators have the following form: where p = (p 1 , ..., p m ) T are the population frequencies of the reference alleles , a = (a 1 , ..., a m ) T are multiplicative factors, and w = (w 1 , ..., w m ) T are non-negative weights satisfying m j=1 w j = 1. The sample version is: where σ 2 j = 2p j (1−p j ) = V ar(X ij ), which can be replaced by a consistent estimator σ 2 j in practice. The scGRM estimator is a special case with a j =X j and w j = 1/m for j = 1, ..., m, which has been proved to be biased.
The robust GRM estimator is a special case with a j =X j and w j = σ 2 j /( m l=1 σ 2 l ) for all j, i.e.
Therefore, with Property 2 in the method section, we have Therefore, rGRM estimator is still biased. The global Day-Williams estimator is a special case with a j = 1 and w j = From the proof of property 2, we know Therefore, Therefore, the global Day-Williams estimator still has a negative bias. Write . Then under linkage equilibrium, we have . The author suggest one should first choose a j that minimizes V j (a j ), and then choose w minimize V ar(ρ ii ), which results in The two-step GRM estimator is then defined asρ T ii = m j=1ŵ j × Z j (â j ). Under the sample version: ( 1 n n a = 1 a = i ρ ia + 1 n n a = 1 a = i ρ ai ).
Because EZ j (â j ) doesn't depend on j, thus no matter how we chooseŵ, ( 1 n n a = 1 a = i ρ ia + 1 n n a = 1 a = i ρ ai )− 1 n , which indicates the two-step GRM estimator is still biased.